Session Summary: How BMW Group uses AWS serverless analytics for a data-driven ecosystem #ANT310 #reinvent
This post is the session report about ANT310: How BMW Group uses AWS serverless analytics for a data-driven ecosystem at AWS re:Invent 2020.
日本語版はこちらから。
Abstract
Data is the lifeblood fueling BMW Group’s digital transformation. It drives BMW’s personalized customer experiences, connected mobility solutions, and analytical insights. This session walks through the journey of building BMW Group’s Cloud Data Hub. BMW Group’s technical lead, Simon Kern, dives deep into how the company is leveraging AWS serverless capabilities for delivering ETL functions on big data in a modularized, accessible, and repeatable fashion and provides insight into the next steps of the journey. The services used in BMW Group’s AWS architecture are discussed, including AWS Glue, Amazon Athena, Amazon SageMaker, and more.
Speakers
- Simon Kern
- Lead DevOps Engineer - BMW Group
How BMW Group uses AWS serverless analytics for a data-driven ecosystem - AWS re:Invent 2020
Content
- BMW Group IT: Brief intro
- Cloud Data Hub: BMW Group’s central data lake
- Orchestrating data
- Ingesting and preparing data
- Analyzing data
- Outlook
BMW Group IT: Brief intro
BMW Group is a global mobility company in 29 countries where there are 60 nationalities. It is also called as IT company because 694 locations are are connected through the global IT network and it delivers over 230 software products. One of the most important services is BMW’s ConnectedDrive backend where over 14 million vehicles are connected to and over 1 billion requests are being served per day.
- -> BMW Group produces a lot of data with the backend systems.
- -> Ingest the data into data lake to organize and analyze together
- -> Cloud Data Hub
Cloud Data Hub
- Cloud native data lake that makes it easy to...
- Ingest data
- Have a scalable storage solution
- Open many possibilities to get values out of the data
- Left Side
- Over 500 software and data engineers
- Build data ingests and data preparations to fuel a data marketplace
- Right Side
- Over 5,000 business analysts and data scientists
- Build use cases, machine learning models and AI products
- -> Data democratization: easy to access all the data in the BMW Group
You can work with the data seamlessly from the portal below.
Internal architecture consists of 3 pillars; the data providers, the data consumers and the data portal and APIs. All of them is done with AWS multi-tiered account setup.
Orchestrating data
- Data Providers (Left Side)
- Global IT unit that provide central datasets
- Local IT unit
- Use Cases (Right Side)
- Controlled by the access management layer to allow to see the data
- Global uses Global datasets
- US uses Global and Local datasets
- Data portal and API (Center)
- Important for customers to...
- Explore and query the data
- Manage metadata
- Deploy infrastructure
- Built on top of several APIs
- Security, central compliance services
- Single sign-on for all users
- Gray boxes
- Separate markets and legal entities into different hubs with own storage accounts and processes.
- Unified seamless front end
- Important for customers to...
- Dataset
- A combination of S3 buckets and Glue databases
- Always lives inside the Hub
- Assigned to a business object which categorizes the data into a separate unit
- 3 types of layers, and every data set is a part of them
- Source: the copy of the source system
- Prepared: clean and harmonize data
- Semantic: make data enriched by aggregation or join
Ingesting and preparing data
- Key concepts of data ingestion
- Ease of use
- Ingestion kickstart via UI-accessible and Typescript-based CDK stack
- Advanced features can be leveraged via Terraform modules
- Flexibility
- Specialized building blocks
- Reusability via modularization
- Maintainability
- Community via internal open source
- Bigger changes via forks
- Ease of use
- Ingestion from on-premise to CDH Core account
- All setup via Terraform
- Glue ETL
- Running in a private VPC
- Read and pull data from on-premise network
- Store it in the central S3 bucket of Cloud Data Hub
- Secrets Manager
- Store database credentials
- Cloud Watch
- Logging and trigger Glue jobs
- Glue Data Catalog
- Lambda syncs catalogs in Provider and CDH Core
- Independent security account
- Store KMS keys
- PII API
- Encrypt sensitive data
- Data preparation
- Set up via another Terraform module
- Read data from the central S3 Bucket (source layer)
- Write it into the prepare layer datasets
- Ingestion challenges
- S3 object ownership problem caused by multi accounts
- The object ownership doesn't match the bucket ownership
- Bucket policy don't apply and IAM role gets cumbersome to configure Glue jobs
- Recently S3 updates that the role switching is not necessary anymore
- Job sizing
- Choose the right number of DPUs
- Hard to automate, then based on best practices
- Light-weight ETL via Spark orchestrated on AWS Fargate
- Small files
- Build the compaction module which is running on Glue and Athana
- S3 object ownership problem caused by multi accounts
- Ingestion recap
- Reusable building blocks for common tasks
- Multi-account setup
- Infrastructure isolation
- Scale - out to whole organization
- Team empowerment
- >150 systems ingested
- >1 PB data volume total
- ~ 100 TB data volume via PySpark-based ETL
- Residual data via stream-based ingests
Analyzing data
- Analyses via
- Amazon Athena
- Amazon SageMaker (optionally with Amazon EMR or AWS Glue ETL development endpoints)
- Amazon QuickSight
- Challenge: Moving from exploration to production for non-experts
- Easy integration in CI / CD for ingestion and transformation
- Managed environment for creating new versions of data products
If you'd like to see the analyses demo, please check the actual session out.
- Architecture
- Central Data Portal and Bitbucket rollout different Code Pipeline
- Exploration: Explore the data and build transformation codes
- Development: Verify the transformation with Athena
- Production: Deploy after the verification
- Central Data Portal and Bitbucket rollout different Code Pipeline
- Data analysis recap
- Enable non-experts via managed toolstack
- Built-in best practices
- Predefined path into production
- Empower experts to utilize the full power of AWS services
Outlook
- Integration of AWS Lake Formation for Fine-grained access control
- Fine-grained data lineage via Spark execution plans
- Automated data monitoring (including frequent users, dataset updates, health, and statistics)
- Query acceleration layer for better performance with established BI tools
AWS re:Invent 2020 is being held now!
You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!